A Post by Michael B. Spring

(A list of all posts by M.B. Spring)

“Structured Documents” – Concept and Form (October 10, 2008)

I was in a discussion with my PhD students this week and the subject of structured documents came up. I was flabbergasted by some of the thoughts that were expressed and by the lack of agreement about what was meant by a structured document, both conceptually and technically. In this posting, I would like to address the issue of structured documents. In my conclusion, I begin what will be another discussion about the appropriate level of document structuring.

A Couple Caveats

More frequently than I would like, students mention memex, Bush, Engelbart, NLS, hypertext, and the World Wide Web when they start to talk about structured documents. While hypertext documents are interesting, and do have a structure associated with them, they have very little to do with “structured documents” as I understand and talk about them. Using a document style or theme in Word does not result in a structured document – it results in a document that is styled not structured. A form can be a structured document, but forms per se are often more like records than structured documents. The World Wide Web is not the origin of structured documents, but structured documents do become more important in the context of the World Wide Web. Technically, an html document is a structured document, but it is a little like saying three kids arguing in a school yard are a legislative body.

Let’s begin at a very simple level. We might make an argument that any string is a structured document. Here I use the term string technically – it as a sequence of characters that may include control characters such as tabs and newlines. In this case, each character has a position in the string. We can say some things about the number of words in the string, the number of lines, the number of sentences and paragraphs – which gets somewhat complicated, etc. There are many possibilities at this low level, and while a string is a structure, it does not define what we mean as a structured document. Ok, let’s now move back through the history of the written word and take a look at structured documents in two eras, the non-digital era and the digital era.

Structured Documents in the Pre-Digital Era

We might imagine a document such as the Iliad or one of the books of the bible. We have a story from the oral tradition that is written down as remembered. It is a stream of characters, or words. It tells a story. It has a beginning and an end. Surely it is logically structured. It may or may not have a title. It may or may not have the signature or the name of an author. It may or may not have parts. If the story is well told, one suspects that it is conceptually well structured. What makes a “good” story is in large part the quality of the flow in the story. Given that it is a transcription of what originated as an oral presentation, it may have little literary structure. As a document, these manuscripts, and many manuscripts produced before mass production printing, have little formal structure.

Fast forward to a modern text book. It has a title page. The title page has the title of the book, the author or authors, the publisher, and the city or cities in which the publisher has offices. The title may consist of a title and subtitle, but the title is singular. There is one publisher. There may be one or many authors. The title page is followed by a “cataloging page”, which my colleagues tell me is simply known as the “verso of the Title Page”, that contains among other things disclaimers, publisher address information, ISBN number, Library of Congress cataloging information, copyright, etc. These pages may be preceded by advertising pages, and are followed by dedication pages, forwards, table of contents, acknowledgements, and then the book proper. Normally, the book is made up of a series of chapters, but it may consist of “parts” which have chapters and the chapters may have sections and subsections. Within these structures, there can be paragraphs, figures, tables, examples, etc.

A modern textbook is a highly structured document. Over time, books have taken on a more structured form. A text book tends to be more highly structured than a trade book. A scholarly journal tends to be more highly structured than a magazine. A business letter tends to be more highly structured than a personal letter. An academic CV tends to be more highly structured than a resume. In the real world, documents have differing levels of structure that are appropriate to the purpose of the document. The source and force of that structure varies greatly. The source may be regulatory, contractual, or consensual. The form of proposals for government funding are a matter of regulation. The provision of information to be included in a textbook by publisher Y is contractual. The information contained in a course syllabus, at least at my institution is more a matter of general consensus. In all of these cases, no judgment is made about the appropriateness or sensibility of this structuring.

Structuring of Digital Documents

The history of the application of computer technology to document processing is long and complex. For the purpose of this discussion, I will divide it into four eras. The first era is the digital typesetting era. This era tends to be associated with procedural copy marking. In conventional publishing, layout editors knew how they would “structure” a textbook. This structure was reflected by the graphical layout of the “elements”. Layout editors learned that the title page was to be a recto page – generally the first page of the book. The index went at the back of the book and generally used two columns and a type size smaller than the body type in the book. Computer scientists worked to develop languages to instruct computerized typesetters to change fonts, margins, horizontal and vertical alignment, spacing, etc. Just as a layout editor would place layout copy marks in the manuscript for the typesetter, the user of early formatting software placed procedural commands in the text file. The high point in this era may well have been the development of Tex by Donald Knuth.

The second era is the heterogeneous device era. There was a period of time when line printers, laser printers, CRT screens, dot matrix printers, typesetters, and robot typewriters all coexisted. During this period, all the procedural languages script, runoff, tex, and a slew of PC based languages, wordstar, peachtext, etc. were evolving to macro languages such as GML, XICS, and LaTex. This era also saw the emergence of structural copymarking. In some ways, the difference between macros and structural copy marking is marginal. In other ways, it is very significant. Some people credit Charles Goldfarb with the “discovery” of structural copymarking and structured documents. I tend to credit Brian Reid who developed Scribe. Here’s the deal. A macro says that @title is associated with a set of procedural copymarks. It is easier to remember than all of the details, and it adds some standardization. A structural copymark says that @title is a copymark that can appear in a unit of the document called a @titlepage, and not elsewhere. Brian Reid developed Scribe to allow users to output their work to multiple devices. Thus, he created macros for each of the devices with common names. Then, he decided to go a step further. He developed what he called make files that contained information about the components in about a dozen types of documents – letter, slide, report, article, manual, etc. It was also possible, although very difficult to specify new types of documents. Here’s the important thing. Generally speaking a macro was designed to aggregate procedural copymarks and execute them all at once. So, sometext @macro othertext would result in output where othertext can after sometext but it was in a different style. In Scribe, you would say @chapter(SomeText) and that would cause SomeText to start on a new page, be some particular font, AND be saved to create a table of contents for the document. Similarly text@footnote[moretext] and text would cause “text and text” to be output with a superscripted number after the first text and “more text” to be output at the bottom of the page with a matching superscript. Reid was in part responding to the need to deal with heterogeneous devices and was trying to make his “descriptive markup” easier to use. While GML and SGML get a lot of the credit – appropriately in terms of the standardization and generalization of the concept. Reid’s scribe provided the earliest functional effort to develop structured documents that were much more than simple macro languages.

The third era is the WYSIWYG era. While the WYSIWYG era brought on by the development of the Alto and the STAR at the Xerox Palo Alto Research Center did a great many things to make our life better, the bitmapped screen and the laser printer caused some problems as well. As “all points addressable” devices, there was no need to deal with the different idiosyncrasies of different devices. Whatever you could put on the screen could be printed. In this era, styles technology was dominant. It was now possible to select a different font and type size for each word on the screen. Sweep out some selection of text, and make it the same as some other selection of text. It became possible to do infinitely complex procedural copy marking without ever knowing one of the commands. It was – is – a struggle to get people to use styles consistently. It is just to easy to do anything we damn well please. The power of descriptive of structural copy marking was lost on those who now could do anything they wanted.

The fourth era is the WWW era. As Berners-Lee moved forward the concept of a universal repository for information, he developed a mechanism for identifying resources – the URL, a mechanism for transporting requests and responses – http, and a mechanism for representing resources – html. Berners-Lee was familiar with SGML because technical papers at the CERN were formatted using an SGML Document Type Definition (DTD). He decided that he could write a DTD that could be used to represent documents in his system. It is not clear whether he actually intended to develop a universal document type or just the first and simplest of what would be many document types. It is clear that his document type was minimalist – a valid html document need only include a title element in the head, but most browsers were happy with less than that – i.e. nothing. Further, while it may not have been his intention, users used his elements as macros rather than descriptive or structural markup. Tables were used to establish formats and block quotes served to indent whole documents. The abuses were many. It quickly became clear that more was needed if we were to be able to unambiguously identify the author, publisher, publication date and structure of these resources. Thus, we began the process of revising SGML to occupy a smaller, more applicable footprint to solve some of the semantic issues related to resources on the web. Many, many coordinated pieces would be required to make this work.

The Current State of Structured Documents

XML and the family of XML standards – xslt, xpath, xslt-fo, schema, schema datatype, xlink, xforms, xquery, etc. provide the standards, specification, and technologies to create structured documents of varying levels. As most readers will know, this is done by specifying a schema. Schema are hard to understand, and even the XML Schema Compact Syntax (XSCS) (Wilde, Erik, and StillHard, Kilian. A Compact XML Schema Syntax. In Proceedings of XML Europe 2003. London. May 2003. (http://dret.net/netdret/docs/wilde-xmleurope2003.html)) can be daunting. Without endeavoring to specify a new syntax, let me suggest a simpler was to imagine a document type. Using simplified notations based on regular expressions, Backus Naur form, and the original SGML DTD syntax, let me specify the following rules:

The first line of the definition specifies the name of the document
The second and successive lines use the BNF form to define ever finer components of document. What precedes the ::= is the object being defined, what follows is the definition.
The left side of each line specifies the element(s) being defined.
The right side of each line provides the model for the element(s)
The element model will consist of elements or primitives
While primitives could be of many types and be extensible, we will restrict ourselves here to three primitives {string}, {number}, and {date}
To be complete, every element must ultimately be defined in terms of primitives
The element model uses parentheses to group the components
The components of the model must be connected using one of three connectors – ‘,’ indicating a sequence, ‘|’ indicating a choice, or & indicating that the members of the set of components so connected may be in any order.
Each component of the model must be modified by 2 numbers in brackets where [1,1] means required, [n,m] means at least n times and as many as m times. The second digit, m may be unbounded, in which case a * is used.

Thus a simple definition might be given as follows:

Document type ::= Memo
Memo ::= headings[1,1], body[1,1], addenda[0,1]
headings ::= to[1,*] , from[1,1] , (date[1,1] & subject[1,1] )
body ::= paragraph[1,*] ,
addenda ::= cc[0,*] & enc[0,1]
to, from, subject, paragraph, cc, enc::= {string}
date::={date}

In this example we define a memo as having headings (required and only once), body (required and only once), and addenda (optional but not more than once) sections in that order. The headings component contains one or more to components followed by one from component followed by one date and one subject where they can be in any order. All of these are defined as strings without further sub components. Even in such a simple example, there are many design decisions and complexities such as:

Granularity of component modeling: I could have specified that to or from was a person or an organization. We then could have specified further component structure – such as a person who has a lastname, firstname, and optional title.
Semantics of components: I chose to use semantics based on common terms used in memos. I might have chosen to use author instead of from and recipents instead of to. Keep in mind, that we could still generate a memo using “To:” and “From:” followed by the strings in the recipient and author components.
Structuring of the canonical form: I used the general form of the components as we would see them in a typical memo. XSLT allows us to have presentation forms that differ from the canonical structure. Thus, I might have constructed my memo with two components – metadata and content. The cc and enc components could have been in the metadata and the body in the content. For presentation purposes, parts of the metadata could be extracted and placed after the content.

Further, these design decisions – as with most things in our document world, cannot be made in the abstract, but must reflect the consensus of the users of these types of documents lest they be ignored. Consider for example a structured document definition of a syllabus. Make it too structured and faculty will ignore it. Make it such that it constrains nobody and likely it will have minimal useful structure. I can imagine a thousand useful things that might be done if syllabi for all college course used some standard format that contained a reasonable level of required detailed content. I can also imagine that the faculty in one small department of information science would not be able to come to an agreement as to what it should contain. Imagine getting all of the faculty in all of the departments of all of the colleges and universities in the United States to agree!

Appropriate Structuring

So, what is appropriate structuring? I begin with a simple classification of document types. Documents can be personal, group, organizational, enterprise (cross organizational), and archival. Documents may migrate from one category to another. At the personal document end of the continuum, the demand for enforced structure is minimal. I write a note to myself in any form I care to. My diary can be kept as I please. On the other hand, documents that need to be exchanged between organizations A, B, and Z need more structure. Consider as one example student transcripts. These documents need to contain certain information and we know what that is. I would contend that given the potential of structured documents, we should over structure so long as we can do it without increasing the burden of authorship. If we can structure our personal diaries such that they can serve as archival documents, we will have wasted our time with my diary, but the benefit of having the diary of Colin Powell in a structured form might prove invaluable

All of this has ignored the telling of a good story. In 1956, I began the process of learning how to tell a structured story in writing. This process went on for 12 years of daily instruction through high school. It involved learning to diagram sentences, outline topics, write good paragraphs, etc. It continued more seriously, but in formal ways less regularly through four years of college. Training then continued on the job – my first boss was a master writer of more than two dozen books. It also involved an intense period of mentored training as I wrote my dissertation. I believe that after 23 years, I had actually learned enough to consider myself a reasonably good writer of a structured story – my dissertation. Another ten years, including multiple articles, proposals, books etc. brought me to a point where I consider myself roughly competent. Before we can learn to write good structured documents using XML and the tools that will emerge, we will need some significant training, beginning in grade school, about how to make the best use of this capabilities. For those of us now in our latter years, it is unlikely, even if we are good writers conceptually, that we will feel and embrace the potential of structured documents – just let me do it my own damn way.